"Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh..."
Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.
Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.
So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.
Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.
What are web scraping and crawling?
Let's first define these terms to make sure that we're on the same page.
- Web scraping: the act of automatically downloading a web page's data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc.).
- Web crawling: the act of automatically downloading a web page's data, extracting the hyperlinks it contains and following them. The downloaded data is generally stored in an index or a database to make it easily searchable.
For example, you may use a web scraper to extract weather forecast data from the National Weather Service. This would allow you to further analyze it.
In contrast, you may use a web crawler to download data from a broad range of websites and build a search engine. Maybe you've already heard of Googlebot, Google's own web crawler.
So web scrapers and crawlers are generally used for entirely different purposes.
Why is web scraping often seen negatively?
The reputation of web scraping has gotten a lot worse in the past few years, and for good reasons:
- It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.
- It's often done in complete disregard of copyright laws and of Terms of Service (ToS).
- It's often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites. They might also choose to stay anonymous and not identify themselves. Finally, they might also perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.
Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e.g. Facebook, LinkedIn, etc.) and online stores (e.g. Amazon). This is probably why Facebook has separate terms for automated data collection.
In contrast, web crawling has historically been used by the well-known search engines (e.g. Google, Bing, etc.) to download and index the web. These companies have built a good reputation over the years, because they've built indispensable tools that add value to the websites they crawl. So web crawling is generally seen more favorably, although it may sometimes be used in abusive ways as well.
So is it legal or illegal?
Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.
The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS). You're essentially putting yourself in a vulnerable position.
Just think about it; you're using the bandwidth of somebody else, and you're freely retrieving and using their data. It's reasonable to think that they might not like it, because what you're doing might hurt them in some way. So depending on many factors (and what mood they're in), they're perfectly free to pursue legal action against you.
I know what you may be thinking. "Come on! This is ridiculous! Why would they sue me?". Sure, they might just ignore you. Or they might simply use technical measures to block you. Or they might just send you a cease and desist letter. But technically, there's nothing that prevents them from suing you. This is the real problem.
Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let's see:
- Violation of the Computer Fraud and Abuse Act (CFAA).
- Violation of California Penal Code.
- Violation of the Digital Millennium Copyright Act (DMCA).
- Breach of contract.
- Trespass.
- Misappropriation.
That lawsuit is pretty concerning, because it's really not clear what will happen to those "anonymous" people.
Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.
Another problem is that law isn't like anything you're probably used to. Because where you use logic, common sense and your technical expertise, they'll use legal jargon and some grey areas of law to prove that you did something wrong. This isn't a level playing field. And it certainly isn't a good situation to be in. So you'll need to get a lawyer, and this might cost you a lot of money.
Besides, based on the above lawsuit by LinkedIn, you can see that cases can undoubtedly become quite complex and very broad in scope, even though you "just scraped a website".
The typical counterarguments brought by people
I found that people generally try to defend their web scraping or crawling activities by downplaying their importance. And they do so typically by using the same arguments over and over again.
So let's review the most common ones:
"I can do whatever I want with publicly accessible data."
False. The problem is that the "creative arrangement" of data can be copyrighted, as described on cendi.gov:
Facts cannot be copyrighted. However, the creative selection, coordination and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.
So a website - including its pages, design, layout and database - can be copyrighted, because it's considered as a creative work. And if you scrape that website to extract data from it, the simple fact of copying a web page in memory with your web scraper might be considered as a copyright violation.
In the United States, copyrighted work is protected by the Digital Millenium Copyright Act (DMCA).
"This is fair use!"
This is a grey area:
- In Kelly v. Arriba Soft Corp., the court found that the image search engine Ditto.com made fair use of a professional photographer's pictures by displaying thumbnails of them.
- In Associated Press v. Meltwater U.S. Holdings, Inc., the court found that Meltwater's news aggregator service didn't make fair use of Associated Press' articles, even though scraped articles were only displayed as excerpts of the originals.
"It's the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway!"
False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You're legally bound by those terms; it doesn't matter that you could get that data manually.
"The worse that might happen if I break their Terms of Service is that I might get banned or blocked."
This is a grey area:
- In Facebook v. Pete Warden, Facebook's attorney threatened Mr. Warden to sue him if he published his dataset comprised of hundreds of million of scraped Facebook profiles.
- In Linkedin Corporation v. Michael George Keating, Linkedin blocked Mr. Keating from accessing Linkedin because he had created a tool that they thought was made to scrape their website. They were wrong. But yet, he has never been able to restore his account. Fortunately, this case didn't go further.
- In LinkedIn Corporation v. Robocog Inc, Robocog Inc. (a.k.a. HiringSolved) was ordered to pay 40000$ to Linkedin for their unauthorized scraping of the site.
"This is completely unfair! Google has been crawling/scraping the whole web since forever!"
True. But law has apparently nothing to do with fairness. It's based on rules, interpreted by people.
"If I ever get sued, I'll Good-Will-Hunting my way into defending myself."
Good luck! Unless you know law and legal jargon extensively. Personally, I don't.
"But I used an automated script, so I didn't enter into any contract with the website."
This is a grey area:
- In Internet Archive v. Suzanne Shell, Internet Archive was found guilty of breach of contract while copying and archiving pages from Mrs. Shell's website using its web crawlers. On her website, Mrs. Shell displays a warning stating that as soon as you copy content from her website, you enter into a contract, and you owe her 5000$US per page copied (!!!). The two parties apparently reached an amicable resolution.
- In Southwest Airlines Co. v. BoardFirst, LLC, BoardFirst was found guilty of violating a browsewrap contract displayed on Southwest Airlines' website. BoardFirst had created a tool that automatically downloaded the boarding passes of Southwest's customers to offer them better seats.
"Terms of Service (ToS) are not enforceable anyway. They have no legal value."
False. The Bingham McCutchen LLP law firm published a pretty extensive article on this matter and they state that:
As is the general rule with any contract, a website's terms of use will generally be deemed enforceable if mutually agreed to by the parties. [...] Regardless of whether a website's terms of use are clickwrap or browsewrap, the defendant's failure to read those terms is generally found irrelevant to the enforceability of its terms. One court disregarded arguments that awareness of a website's terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms. Similarly, one court imputed knowledge of a website's terms of use to a defendant who had repeatedly accessed that website using such tools. Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e.g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.
In other words, Terms of Service (ToS) will be legally enforced depending on the court, and if there's sufficient proof that you were aware of them.
"I respected their robots.txt and I crawled at a reasonable speed, so I can't possibly get into trouble, right?"
This is a grey area.
robots.txt is recognized as a "technological tool to deter unwanted crawling or scraping". But whether or not you respect it, you're still bound to the Terms of Service (ToS).
"Okay, but this is for personal use. For my personal research only. I won't re-publish it, or publish any derivative dataset, or even sell it. So I'm good to go, right?"
This is a grey area. Terms of Service (ToS) often prohibit automatic data collection, for any purpose.
According to the Bingham McCutchen LLP law firm:
The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.
"But the website has no robots.txt. So I can do what I want, right?"
False. You're still bound to the Terms of Service (ToS), and the content is copyrighted.
General advice for your scraping or crawling projects
Based on the above, you can certainly guess that you should be extra cautious with web scraping and crawling.
Here are a few pieces of advice:
- Use an API if one is provided, instead of scraping data.
- Respect the Terms of Service (ToS).
- Respect the rules of robots.txt.
- Use a reasonable crawl rate, i.e. don't bombard the site with requests. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
- Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you're doing and why, and link back to the page in your user agent string (e.g. 'MY-BOT (+https://yoursite.com/mybot.html)')
- If ToS or robots.txt prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
- Don't republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
- If you doubt on the legality of what you're doing, don't do it. Or seek the advice of a lawyer.
- Don't base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. v. 3Taps Inc..
- Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.
Remember that companies and individuals are perfectly free to sue you, for whatever reasons they want. This is most likely not the first step that they'll take. But if you scrape/crawl their website without permission and you do something that they don't like, you definitely put yourself in a vulnerable position.
Conclusion
As we've seen in this post, web scraping and crawling aren't illegal by themselves. They might become problematic when you play on somebody else's turf, on your own terms, without obtaining their prior permission. The same is true in real life as well, when you think about it.
There are a lot of grey areas in law around this topic, so the outcome is pretty unpredictable. Before getting into trouble, make sure that what you're doing respects the rules.
And finally, the relevant question isn't "Is this legal?". Instead, you should ask yourself "Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response?".
So I hope that you appreciated my post! Feel free to leave a comment in the comment section below!
This post was featured on Hacker News, Reddit, Lobsters and in the Programming Digest newsletter. Thanks to everyone for your support and feedback!